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Abstract 

We  have  implemented  a  secure  network  file  system  called 
SUNDR  that  guarantees  the  integrity  of  data  even  when 
malicious  parties  control  the  server.  SUNDR  splits  stor¬ 
age  functionality  between  two  untrusted  components,  a 
block  store  and  a  consistency  server.  The  block  store 
holds  all  file  data  and  most  metadata.  Without  interpret¬ 
ing  metadata,  it  presents  a  simple  interface  for  clients  to 
store  variable-sized  data  blocks  and  later  retrieve  them  by 
cryptographic  hash. 

The  consistency  server  implements  a  novel  protocol 
that  guarantees  close-to-open  consistency  whenever  users 
see  each  other’s  updates.  The  protocol  roughly  consists 
of  users  exchanging  version-stamped  digital  signatures  of 
block  server  metadata,  though  a  number  of  subtleties  arise 
in  efficiently  supporting  concurrent  clients  and  group- 
writable  files.  We  have  proven  the  protocol’s  security  un¬ 
der  basic  cryptographic  assumptions.  Without  somehow 
producing  signed  messages  valid  under  a  user’s  (or  the 
superuser’s)  public  key,  an  attacker  cannot  tamper  with 
a  user’s  files — even  given  control  of  the  servers  and  net¬ 
work.  Despite  this  guarantee,  SUNDR  performs  within 
a  reasonable  factor  of  existing  insecure  network  file  sys¬ 
tems. 

1  Introduction 

Nobody  wants  malicious  attackers  tampering  with  his 
files.  This  basic  and  obvious  fact  underlies  much  of  the 
way  people  manage  data.  Important  file  systems  must  be 
kept  on  secure  servers  in  machine  rooms  to  which  only 
authorized  people  have  access.  Only  highly  trusted  ad¬ 
ministrators  can  perform  mundane  tasks  such  as  backup 
and  hardware  maintenance.  Conversely,  people  must  ad¬ 
dress  the  constant  threat  of  attackers  gaining  administra¬ 
tive  privileges  on  servers.  Otherwise,  a  number  of  readily 
available  “rootkits”  allow  attackers  who  penetrate  a  sys¬ 
tem  to  replace  core  operating  system  utilities  with  altered 
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versions  that  open  back  doors  and  conceal  evidence  of  the 
security  breach. 

Data  security  is  often  viewed  as  the  problem  of  building 
a  better  fence  around  storage  servers — limiting  the  num¬ 
ber  of  people  with  server  access,  disabling  unnecessary 
daemons  that  might  be  remotely  exploitable,  and  staying 
current  with  security  patches  to  minimize  the  window  of 
vulnerability  to  software  flaws.  This  approach  has  two 
drawbacks.  First,  experience  has  shown  in  many  cases 
that  people  do  not  build  high  enough  fences.  Second,  per¬ 
haps  more  important,  high  fences  are  inconvenient;  they 
restrict  the  ways  in  which  people  can  manage  data. 

An  alternative  approach  is  to  reduce  the  security  needs 
of  file  servers.  This  paper  presents  SUNDR,  a  secure  net¬ 
work  file  system  designed  to  do  exactly  this.  Assuming 
only  the  existence  of  digital  signatures  and  a  collision- 
resistant  hash  function,  SUNDR’s  protocol  provably  guar¬ 
antees  the  integrity  and  consistency  of  file  system  data, 
even  when  malicious  parties  entirely  control  the  server. 
Unlike  previous  Byzantine-fault-tolerant  file  systems  [3, 
22]  that  distribute  trust  but  assume  that  a  threshold  frac¬ 
tion  of  servers  are  honest,  SUNDR  assumes  no  on-line 
trusted  parties.  To  tamper  with  a  user’s  files  without  being 
detected,  an  attacker  must  either  compromise  the  user’s 
client  while  the  user  is  logged  in,  or  otherwise  produce 
valid  digital  signatures  under  the  user’s  public  key.  In  par¬ 
ticular,  the  superuser’s  private  signature  key  can  be  stored 
off-line  when  not  in  use,  making  it  extremely  difficult  for 
an  attacker  to  gain  superuser  access  to  the  file  system. 

SUNDR  does  not  currently  address  the  issue  of  storage 
reliability.  Of  course,  an  attacker  can  always  physically 
damage  a  server  or  wipe  its  disks.  However,  SUNDR 
stores  all  long-lived  data  in  an  append-only  block  log. 
Thus,  it  could  use  append-only  storage  [27]  to  gain  re¬ 
silience  to  network  attacks.  Incremental  off-site  backups 
are  also  easily  implementable  to  survive  physical  compro¬ 
mise.  Moreover,  because  SUNDR  does  not  trust  the  file 
server,  after  a  disaster,  any  lost  file  system  data  can  safely 
be  recovered  from  other  untrusted  sources  that  might  have 
the  data,  such  as  clients’  file  caches. 

SUNDR’s  security  model  gives  people  more  options 
for  managing  their  data  than  current  systems.  For  in¬ 
stance,  organizations  can  outsource  data  storage  without 
fear  of  the  server  operators  tampering  with  data.  It  also  of¬ 
fers  a  vast  improvement  over  current  file  system  security. 
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An  attacker  who  compromises  a  SUNDR  server  cannot 
tamper  with  file  contents.  Our  prototype  implementation 
gives  performance  within  a  reasonable  factor  of  the  popu¬ 
lar  NFS  file  system,  making  SUNDR  practical  despite  its 
significantly  increased  security. 

The  next  section  gives  an  overview  of  the  SUNDR  pro¬ 
tocol.  The  following  two  sections  describe  SUNDR’s 
implementation  and  how  we  tuned  the  protocol  to  give 
acceptable  performance.  Section  5  evaluates  the  perfor¬ 
mance  of  our  implementation.  Section  6  discusses  related 
work,  and  Section  7  concludes. 


2  Protocol 

The  SUNDR  protocol  provably  guarantees  the  integrity 
and  consistency  of  file  system  contents,  preventing  un¬ 
trusted  server  operators  or  malicious  parties  who  compro¬ 
mise  a  server  from  undetected  tampering  with  data.  This 
section  gives  an  overview  of  the  protocol  and  its  security 
properties,  and  explains  how  SUNDR’s  semantics  differ 
from  those  of  ordinary  Unix  file  systems.  A  more  detailed 
description  of  the  protocol  and  a  proof  of  its  security  were 
presented  in  [14],  though  the  published  version  of  that  pa¬ 
per  did  not  present  a  protocol  for  group-writable  files. 

At  the  highest  level,  SUNDR  consists  of  a  client,  a 
block  sen’er,  and  a  consistency  sen’er.  The  block  server 
is  responsible  for  storing  and  serving  data.  It  names  each 
data  block  by  a  collision-resistant  cryptographic  hash,  al¬ 
lowing  clients  to  verify  that  every  block  they  retrieve  con¬ 
tains  the  correct  contents.  The  consistency  server  is  re¬ 
sponsible  for  assigning  an  order  to  all  file  system  opera¬ 
tions.  It  employs  a  new  protocol  through  which  clients 
verify  that  other  clients  have  been  presented  with  a  con¬ 
sistent  view  of  the  file  system. 

Every  file  in  a  SUNDR  file  system  belongs  to  either  a 
user  or  a  group.  For  the  purposes  of  this  paper,  a  user  is 
a  principal  logged  into  a  single  SUNDR  client.  (Human 
users  logged  into  multiple  terminals  function  as  separate 
SUNDR  users  in  the  same  SUNDR  group.)  Each  user  has 
a  public  signature  key.  To  change  a  file,  one  must  produce 
a  message  signed  by  the  file’s  owner  or,  for  group-owned 
files,  by  a  member  of  the  group  that  owns  the  file.  As  a 
special  case,  the  superuser  can  sign  any  data  structure. 

All  users  know  each  other’s  public  keys  and  group 
membership.  Currently,  this  information  is  manually  dis¬ 
tributed  to  clients,  though  in  the  future  we  envisage  man¬ 
aging  keys  and  groups  through  superuser-owned  files  in 
the  file  system.  The  server  does  not  have  any  user’s  sig¬ 
nature  key,  and  thus  cannot  make  unauthorized  changes 
to  the  file  system.  Moreover,  the  server  operator  does  not 
need  the  superuser  key.  In  fact,  the  superuser’s  signature 
key  need  not  be  anywhere  on-line  unless  a  privileged  user 
is  accessing  the  file  system  as  an  administrator. 


2.1  Data  structures 

Figure  1  shows  the  basic  SUNDR  data  structures.  Every 
file  is  identified  by  a  (principal,  i-number)  pair,  where 
principal  is  the  user  or  group  that  owns  the  file  and  i- 
number  is  a  per-user  or  per-group  inode  number.  (Un¬ 
like  traditional  file  systems,  SUNDR  allows  files  owned 
by  different  users  to  have  the  same  i-number.)  Directory 
entries  map  file  names  onto  (principal,  i-number)  pairs. 
A  per-principal  data  structure  called  the  i-table  maps  each 
active  i-number  to  a  collision-resistant  SHA-1  [7]  hash  of 
the  file’s  inode.  We  call  this  value  the  file’s  i-hash.  Inodes 
themselves  contain  SHA-1  hashes  of  file  data  blocks  and 
indirect  blocks,  an  approach  taken  from  the  SFSRO  [10] 
read-only  file  system. 

SUNDR  forms  a  cryptographic  hash  tree  [15]  from 
each  i-table.  The  root  of  this  tree  is  called  the  i-handle. 
Given  an  i-handle  and  all  appropriate  intermediary  data 
structures,  one  can  verify  any  block  of  any  file  in  the  i- 
table.  Thus,  securely  retrieving  file  system  contents  boils 
down  to  the  problem  of  first  obtaining  the  latest  i-handle 
of  each  user  and  group,  then  retrieving  any  needed  data 
blocks  by  their  SHA-1  hashes.  The  latter  functionally 
is  conceptually  simple  to  implement  in  the  block  server, 
while  the  former  requires  a  somewhat  complex  consis¬ 
tency  protocol. 

2.2  Consistency  protocol 

Traditional  network  file  systems  provide  close-to-open 
consistency.  If  one  user  modifies  and  closes  a  file,  another 
user  who  subsequently  opens  and  reads  the  file  should  see 
the  data  most  recently  written  there.  Concurrent  opera¬ 
tions  can  be  ordered  at  the  discretion  of  the  server,  but  all 
clients  must  agree  on  the  order  of  any  conflicting  opera¬ 
tions. 

SUNDR  achieves  consistency  by  embedding  i-handles 
in  digitally  signed  version  structures  through  which 
clients  can  verify  that  the  file  system  has  been  deliver¬ 
ing  close-to-open  consistency.  As  long  as  users  see  the 
effects  of  each  other’s  operations,  they  are  guaranteed  to 
be  getting  close-to-open  consistency. 1  There  remains  the 
possibility  of  a  malicious  server  entirely  concealing  some 
users’  actions  from  others,  if  neither  collection  of  users 
expects  anyone  from  the  other  to  have  accessed  the  file 
system.  However,  such  attacks  are  very  likely  to  be  dis¬ 
covered  quickly,  as  once  a  malicious  server  has  “forked” 
the  state  of  the  file  system,  it  can  never  again  allow  de¬ 
ceived  users  to  see  any  evidence  of  each  other’s  on-line 
activity. 

'More  precisely,  if  user  A  sees  the  effect  of  an  operation  user  B 
performed  at  time  t,  then,  at  least  until  t,  A  and  B  had  close-to-open 
consistency  with  respect  to  each  other.  Moreover,  anyone  who  sees  a 
subsequent  operation  by  A  must  also  have  had  close-to-open  consistency 
with  respect  to  B  until  t. 


2 


user  u2’ s 
i-handle 


Figure  1:  Basic  SUNDR  data  structures.  An  i-handle  is  the  root  of  a  hash  tree  containing  a  user  or  group  i-table. 
A  group  i-table  maps  group  inode  numbers  to  user  inode  numbers.  A  user  i-table  maps  a  user’s  inode  numbers  to 
i-hashes.  An  i-hash  is  the  hash  of  an  inode,  which  in  turn  contains  hashes  of  file  data  blocks. 


Version  structures  use  a  technique  somewhat  similar  to 
version  vectors  [4]  to  detect  inconsistencies.  Each  user 
and  group  has  a  version  number.  A  version  structure  con¬ 
tains  the  latest  version  number  of  every  user  and  group 
at  the  time  of  the  file  system  operations  it  reflects.  Two 
version  structures  are  compatible  if  one  contains  at  least 
as  high  a  version  number  as  the  other  for  all  principals. 
When  creating  a  version  structure,  users  increment  both 
their  own  version  numbers  and  the  version  numbers  of  any 
group  whose  i-tables  they  are  modifying.  Both  opens  and 
closes  require  new  version  structures  to  be  computed.  If 
the  server  fails  to  deliver  close-to-open  consistency,  users 
will  detect  this  upon  seeing  each  other’s  incompatible  ver¬ 
sion  structures.  Figure  2  gives  an  example. 

While  the  basic  idea  is  simple,  a  significant  complica¬ 
tion  is  that  under  concurrent  updates,  SUNDR  users  need 
to  compute  and  sign  their  own  version  structures  with¬ 
out  necessarily  having  access  to  all  other  users’  latest  ver¬ 
sion  structures.  To  allow  concurrent  updates,  the  SUNDR 
consistency  protocol  introduces  two  RPCs,  UPDATE  and 
COMMIT,  illustrated  in  Figure  3.  Both  RPCs  contain  mes¬ 
sages  signed  under  users’  public  keys. 

Intuitively,  UPDATE  declares  a  user’s  intent  to  perform 
some  operation  on  the  file  system.  An  honest  server  to¬ 
tally  orders  all  file  system  operations  by  the  time  at  which 
it  receives  their  update  certificate.  UPDATE  returns  the 
precise  state  of  the  file  system  at  the  time  the  correspond¬ 
ing  operation  is  to  have  taken  place.  COMMIT  contains 
the  user’s  version  structure,  certifying  the  order  that  the 
server  has  assigned  to  operations.  Clients  wait  for  and 
verify  users’  signed  version  structures  before  reading  the 
effects  of  any  declared  modifications,  so  as  to  verify  the 
consistency  provided  by  the  server. 

In  more  detail,  when  a  user  makes  a  file  system  call. 


Figure  2:  Example  version  structures.  In  A-C,  users  u± 
and  //,2  both  modify  group  g’s  i-handle  and  increment  its 
version  number.  v,  \  again  modifies  g  in  D,  but  the  mali¬ 
cious  server  conceals  this  update  from  112,  forking  the  file 
system  state.  u2  subsequently  signs  a  version  structure  in¬ 
compatible  with  Mi’s,  which  the  two  users  will  discover 
should  they  ever  again  see  any  of  each  other’s  updates. 
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[system  call  returns  on  client] 

COMMIT 
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Figure  3:  SUNDR  consistency  protocol.  The  client  de¬ 
clares  its  intent  to  fetch  or  modify  file  system  content  with 
a  signed  update  certificate.  The  server  responds  with  the 
latest  version  structure  for  each  i-handle  plus  any  pend¬ 
ing  update  certificates.  The  client  then  computes  new  i- 
handles  and  sends  a  signed  version  structure  to  the  server. 

the  client  first  sends  the  server  a  signed  update  certificate 
including  the  user’s  version  number  in  the  forthcoming 
version  structure  and  a  list  of  changes  or  deltas  the  user 
wishes  to  make  to  the  file  system.  If  the  user  is  simply 
looking  up  a  file  name  in  a  directory  or  opening  a  file 
for  reading,  the  list  of  deltas  is  empty.  Otherwise,  when 
changing  the  state  of  the  file  system,  deltas  may  be  of  four 
types: 

•  Set  file  (user,  \fifi)  to  i-hash  h. 

•  Set  group  file  (group,  i#)  to  (user,  i#). 

•  Set/delete  entry  name  in  directory  (princ., 

•  Pre-allocate  a  range  of  group  i-numbers  (pointing 
them  to  unallocated  user  i-numbers). 

The  effects  of  all  deltas  are  deterministic,  so  that  multiple 
clients  applying  the  same  deltas  to  the  same  i-handles  in 
the  same  order  will  always  produce  the  same  result. 

In  response  to  an  update  message,  the  server  sends  back 
the  latest  signed  version  structure  that  it  has  for  every  i- 
handle,  plus  a  list  of  signed  update  certificates  for  opera¬ 
tions  not  yet  reflected  in  the  version  structure  list.  We  call 
the  collection  of  version  structures  returned  the  version 
structure  list ,  or  VSL,  and  the  list  of  update  certificates 
the  pending  version  list,  or  PVL.  An  operation  with  up¬ 
date  certificate  x  occurs  after  every  operation  reflected  in 
the  VSL  and  PVL  returned  for  x,  and  before  any  operation 
whose  UPDATE  RPCs  return  x  in  the  PVL. 

When  the  client  receives  the  reply  to  its  update  certifi¬ 
cate,  it  checks  the  PVL  for  any  read-after-write  conflicts. 


If  it  is  not  trying  to  read  a  file  that  another  client  is  cur¬ 
rently  in  the  process  of  modifying,  the  client  can  return 
immediately  from  the  file  system  call.  Otherwise,  it  must 
wait  for  any  pending  conflicting  operations  to  move  from 
the  PVL  to  the  VSL.  In  either  case,  the  client  computes  a 
new  version  structure  based  on  the  VSL  and  PVL  it  has  re¬ 
ceived.  This  version  structure  contains  the  user’s  i-handle, 
the  i-handles  of  any  groups  the  user  has  modified,  the  cur¬ 
rent  latest  version  number  of  every  user  and  every  group, 
and  a  list  of  pending  operations  in  the  PVL,  including  for 
each  pending  operation  a  SHA-1  hash  of  its  forthcoming 
version  structure  without  the  i-handles.  When  computing 
group  i-handles,  the  client  must  reflect  any  previous  mod¬ 
ifications  to  the  group  in  the  PVL. 

A  proof  of  the  consistency  protocol  was  presented 
in  [14],  but  is  beyond  the  scope  of  this  paper.  At  a 
high  level,  however,  its  security  follows  from  two  prop¬ 
erties.  First,  once  two  users  have  signed  incompatible 
version  structures,  they  can  never  again  sign  compatible 
ones  without  detecting  the  attack.  Second,  once  a  pend¬ 
ing  operation  has  appeared  in  the  PVL,  it  will  remain  there 
until  the  corresponding  version  structure  (or  a  later  one)  is 
signed  by  the  user  making  the  update.  Thus,  if  server  mis¬ 
behavior  makes  two  users’  version  structures  incompati¬ 
ble,  neither  user  can  complete  any  system  call  that  con¬ 
flicts  with  an  operation  by  the  other  that  was  pending  at 
the  time  of  the  attack.  It  is  this  last  property  that  makes 
it  safe  for  one  user  to  apply  another  user’s  deltas  to  group 
i-tables. 

2.3  Semantics 

SUNDR’s  semantics  differ  from  traditional  Unix  file  sys¬ 
tems  in  several  ways.  The  implementation  currently 
does  not  keep  track  of  time  of  last  access  (“atime”)  for 
files,  as  atime  cannot  be  tracked  securely  by  an  untrusted 
server.  Other  file  times,  including  the  i-node  change  time 
(“ctime”),  are  set  by  clients,  allowing  users  who  hack  their 
client  software  to  set  dimes  of  files  they  own  to  arbitrary 
values.  SUNDR  also  has  no  equivalent  of  the  Unix  “sticky 
bit”  on  directories. 

While  Unix  files  have  both  an  owner  and  a  group, 
group-writable  files  in  SUNDR  do  not  have  a  comparable 
notion  of  owner — the  user  who  last  wrote  the  file  func¬ 
tions  as  its  owner.  The  process  of  changing  the  permis¬ 
sions  on  a  SUNDR  file  also  functions  more  like  making 
a  new  copy  of  the  file;  it  requires  write  permission  on  the 
parent  directory  and  does  not  affect  other  hard  links  to 
the  file.  Fortunately,  SUNDR’s  content-hash  based  block 
store  allows  such  copies  to  be  made  cheaply.  The  owner  of 
a  directory  also  can  delete  any  entries  in  that  directory — 
including  non-empty  directories  to  which  he  or  she  does 
not  have  write  permission. 

SUNDR’s  semantics  were  primarily  driven  by  its  secu- 
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Figure  4:  Implementation  overview 


rity  requirements.  Though  they  depart  significantly  from 
traditional  file  systems  in  some  areas,  the  new  semantics 
offer  several  benefits.  In  particular,  it  is  often  useful  to 
know  which  user  has  last  written  a  group-writable  file. 
Furthermore,  if  the  SUNDR  block  server  implemented 
disk  quotas,  it  could  limit  users  by  the  actual  number  of 
new  blocks  they  write,  whether  to  group-owned  or  user- 
owned  files.  In  Unix,  the  fact  that  users  may  be  unable  to 
delete  subdirectories  of  their  own  directories  is  somewhat 
of  an  annoyance;  giving  them  this  right  does  not  apprecia¬ 
bly  weaken  security,  as  users  anyway  have  the  ability  to 
rename  directories  they  cannot  remove.  Note  that  except 
for  the  ctime  value,  malicious  users  who  violate  the  pro¬ 
tocol  can  only  affect  files  in  ways  they  equivalently  could 
through  the  system  call  interface. 

We  emphasize  also  that  SUNDR’s  focus  is  on  guaran¬ 
teeing  the  integrity  and  consistency  of  data.  The  current 
implementation  does  not  directly  address  the  questions 
of  privacy,  availability,  or  reliability,  but  SUNDR  is  en¬ 
tirely  compatible  with  these  goals.  A  number  of  previ¬ 
ously  developed  cryptographic  storage  techniques  could 
be  applied  to  achieve  some  degree  of  privacy  in  SUNDR, 
e.g.  [12,  1].  Extensive  work  has  also  been  done  on  ensur¬ 
ing  data  availability  in  the  face  of  disk  and  server  failures. 
The  fact  that  STORE  RPCs  can  safely  be  reordered  would 
make  it  easy  to  replicate  the  block  server,  while  the  con¬ 
sistency  protocol  already  replicates  the  PVL  and  and  VSL 
on  clients.  SUNDR  also  facilitates  reliability  by  storing 
long-lived  data  in  an  append-only  log,  making  the  server 
amenable  to  backup  on  write-once  media  and  incremen¬ 
tal  transfer  to  off-site  replicas.  Because  the  server  is  not 
trusted,  a  damaged  file  system  can  be  reconstructed  from 
all  available  sources  of  data,  including  file  caches  on  un¬ 
trusted  clients. 

3  File  system  implementation 

Figure  4  shows  the  overall  architecture  of  the  SUNDR  im¬ 
plementation.  The  client  is  implemented  at  user  level, 
using  a  modified  version  of  the  xfs  device  driver  from 
the  ARLA  [28]  file  system  on  top  of  a  slightly  modified 
FreeBSD  kernel.  Server  functionality  is  divided  between 


a  consistency  server  and  a  block  server.  The  consistency 
server  implements  the  UPDATE  and  COMMIT  RPCs  de¬ 
scribed  in  the  previous  section.  The  only  file  system  data 
it  ever  sees  are  the  i-handles  embedded  in  version  struc¬ 
tures. 

Most  file  system  data  is  stored  on  the  block  server. 
The  block  server  is  a  user-level  program  that  reads  and 
writes  raw  disks  through  character  devices  so  as  to  man¬ 
age  its  own  cache  and  control  the  order  in  which  writes  go 
to  disk.  The  interface  primarily  consists  of  three  RPCs, 
STORE,  RETRIEVE,  and  DECREF.  STORE  stores  a  block 
of  data  at  the  server.  RETRIEVE  takes  a  SHA-1  hash  as 
an  argument  and  returns  a  block  with  that  hash  (assuming 
one  has  been  stored).  DECREF  informs  the  server  that  the 
client  will  no  longer  attempt  to  retrieve  a  particular  SHA- 
1  hash,  so  that  the  server  can  discard  the  corresponding 
block  if  no  one  else  has  stored  it.  A  few  less  fundamental 
RPCs  are  not  shown  in  Figure  4 — for  instance  to  support 
authenticated  communication  so  that  an  attacker  cannot 
impersonate  the  block  server  and  begin  discarding  newly 
stored  blocks. 

The  remainder  of  this  section  describes  the  implemen¬ 
tation  of  the  client  and  consistency  server,  and  explains 
how  we  optimized  the  SUNDR  protocol  to  make  the  file 
system  perform  acceptably.  The  next  section  gives  the 
details  of  our  high-performance  block  server  implementa¬ 
tion. 

3.1  File  system  client 

The  xfs  device  driver  used  by  SUNDR  is  designed  for 
whole-file  caching.  When  a  file  is  opened,  xfs  makes  an 
upcall  to  the  SUNDR  client  asking  for  the  file’s  data.  The 
client  returns  the  identity  of  a  local  file  that  has  a  cached 
copy  of  the  data.  All  reads  and  writes  are  performed  on 
the  cached  copy,  without  further  involvement  of  SUNDR. 
When  the  file  is  closed  (or  flushed  with  fsync),  if  it  has 
been  modified,  xfs  makes  another  upcall  asking  the  client 
to  write  the  data  back  to  the  server.  Several  other  types  of 
upcalls  allow  xfs  to  look  up  names  in  directories,  request 
file  attributes,  create/delete  files,  and  change  metadata. 

As  distributed,  xfs’s  interface  posed  two  problems  for 
SUNDR.  First,  xfs  caches  information,  such  as  local  file 
bindings,  which  it  uses  to  satisfy  some  requests  without 
upcalls.  In  SUNDR,  some  of  these  requests  require  inter¬ 
action  with  the  consistency  server  for  the  security  prop¬ 
erties  to  hold.  We  therefore  modified  xfs  to  invalidate 
its  cache  tokens  immediately  after  getting  or  writing  back 
cached  data,  so  as  to  ensure  that  the  user-level  client  gets 
control  whenever  the  protocol  requires  an  UPDATE  RPC. 

Second,  some  system  calls  that  should  require  only  a 
single  interaction  with  the  SUNDR  consistency  server  re¬ 
sult  in  multiple  kernel  vnode  operations  and  xfs  upcalls. 
For  example,  the  system  call  “stat  ("a/b/c",  &sb)” 
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results  in  three  xfs  GETNODE  upcalls  (for  the  directory 
lookups)  and  one  GETATTR.  The  whole  system  call  should 
require  only  one  UPDATE  RPC.  Yet  if  the  user-level  client 
does  not  know  that  the  four  upcalls  are  on  behalf  of  the 
same  system  call,  it  must  check  the  freshness  of  its  i- 
handles  four  separate  times  with  four  UPDATE  RPCs. 

To  eliminate  unnecessary  RPCs,  we  modified  the 
FreeBSD  kernel  to  keep  a  count  of  the  number  of  invo¬ 
cations  of  system  calls  that  might  require  an  interaction 
with  the  consistency  server.  We  increment  the  counter 
at  the  start  of  every  system  call  that  takes  a  pathname  as 
an  argument  (e.g.,  stat,  open,  readlink,  chdir).  The 
SUNDR  client  memory  maps  this  counter  and  records  the 
last  value  it  has  seen.  If  xfs  makes  an  upcall  that  does  not 
change  the  state  of  the  file  system  and  the  counter  has  not 
changed,  then  the  client  can  use  its  cached  copies  of  all 
i-handles. 

3.2  Protocol  optimizations 

Any  implementation  of  the  SUNDR  protocol  faces  a  num¬ 
ber  of  potential  performance  bottlenecks.  Digital  signa¬ 
tures  can  be  costly  to  compute  and  verify,  yet  are  on  the 
critical  path  for  every  RPC  to  the  consistency  server.  Re¬ 
computing  hash  trees  for  every  file  system  modification  is 
expensive  and  requires  large  amounts  of  bandwidth  and 
storage  from  the  block  server.  Group  i-tables  in  particular 
are  expensive  to  update,  as  different  users  must  transfer 
new  hash  nodes  back  and  forth  through  the  block  server, 
and  a  new  i-table  could  potentially  force  clients  to  flush 
their  name  caches. 

The  first  requirement  for  decent  performance  was  to  re¬ 
duce  the  cost  of  digital  signatures  in  SUNDR.  Our  im¬ 
plementation  does  so  in  three  ways — by  choosing  an  ef¬ 
ficient  signature  algorithm,  by  reducing  the  total  number 
of  signatures  required,  and  by  moving  signatures  and  ver¬ 
ifications  out  of  the  critical  path  wherever  possible.  We 
chose  the  Esign  [18]  signature  algorithm  for  SUNDR  be¬ 
cause  it  is  over  an  order  of  magnitude  faster  than  more 
popular  schemes  such  as  RSA.  While  there  are  no  known 
attacks  on  the  version  of  Esign  we  are  using,  a  flaw  was 
recently  found  in  its  proof  of  security  and  a  better  variant 
proposed  [11]  with  a  correct  proof.  We  intend  to  switch  to 
the  newer  variant.  All  experiments  reported  in  this  paper 
use  2,048-bit  public  keys,  which  require  a  work  factor  of 
well  over  280  to  break  using  the  fastest  known  algorithms. 

To  get  verification  out  of  the  critical  path,  the  consis¬ 
tency  server  also  processes  and  replies  to  an  UPDATE  RPC 
before  verifying  the  signature  on  its  update  certificate.  If 
the  signature  fails  to  verify,  the  server  reverts  any  effects 
of  the  RPC  (before  other  clients  can  see  them)  and  drops 
the  TCP  connection  to  the  forging  client.  This  behavior  is 
acceptable  because  only  a  faulty  client  would  forge  signa¬ 
tures.  With  this  optimization,  the  consistency  server’s  ver¬ 


ification  of  one  signature  overlaps  with  the  client’s  com¬ 
putation  of  the  next. 

The  second  class  of  optimizations  reduces  the  cost  of 
processing  i-table  hash  trees.  First,  because  group  i-tables 
are  particularly  expensive  to  modify,  SUNDR  minimizes 
the  number  of  times  group  i-handles  change  through  a 
level  of  indirection;  group  i-table  entries  map  to  user  i- 
table  entries  which  map  to  i-hashes.  Thus,  a  group  i- 
handle  doesn’t  change  when  a  group-writable  file  or  di¬ 
rectory  is  modified  multiple  times  by  the  same  user,  a 
common  case.  When  creating  group-writable  files,  clients 
pre-allocate  group  i-numbers  in  batches,  pointing  them  to 
unused  ranges  of  user  i-nodes.  256  group  i-numbers  can 
be  pre-allocated  with  a  single  delta  in  an  update  certifi¬ 
cate. 

As  a  further  optimization  for  both  user  and  group  i- 
tables,  SUNDR  avoids  recomputing  i-handles  on  every 
update.  Instead,  it  specifies  a  principal’s  i-table  with 
two  hashes  embedded  in  version  structures — an  i-handle, 
which  is  still  the  root  of  a  hash  tree,  and  a  log  hash ,  which 
is  the  SHA-1  hash  of  a  vector  of  deltas  to  be  applied  to 
state  of  the  hash  tree.  The  COMMIT  RPC  takes  a  log  as 
well  as  a  version  structure,  and  the  UPDATE  RPC  returns 
a  log  for  each  i-handle.  Clients  only  periodically  compute 
new  i-handles  to  clear  the  log  hash. 

Log  hashes  have  a  number  of  benefits.  They  amortize 
computation  of  new  i-handles  over  many  updates.  They 
save  bandwidth  to  the  block  server  by  requiring  fewer 
hash  tree  blocks  to  be  stored.  For  group  i-tables,  they  al¬ 
low  one  client  to  apply  another’s  update  certificate  deltas 
without  fetching  any  blocks  from  the  server.  Most  im¬ 
portantly,  they  allow  clients  to  validate  cached  informa¬ 
tion  without  fetching  i-table  blocks  from  the  server.  Both 
name  cache  entries  and  cached  file  contents  are  valid  so 
long  as  the  relevant  i-handles  haven’t  changed  and  the  par¬ 
ticular  file  is  not  affected  by  any  deltas  in  the  hashed  log. 

3.3  Future  optimizations 

Finally,  we  mention  several  optimizations  that  we  would 
like  to  implement  but  have  not  yet.  To  reduce  the  num¬ 
ber  of  signatures  required,  the  SUNDR  client  might  avoid 
signing  version  structures  by  eliding  COMMIT  RPCs. 
(This  is  safe,  as  from  other  clients’  point  of  view  an  omit¬ 
ted  COMMIT  is  equivalent  to  two  COMMITS  arriving  back- 
to-back  at  the  server.)  After  completing  an  UPDATE  RPC, 
the  client  can  return  immediately  to  xfs,  delaying  com¬ 
putation  of  the  new  version  structure.  If,  within  a  small 
number  of  milliseconds,  no  further  UPDATE  RPCs  are  re¬ 
quired,  the  client  can  then  sign  a  version  structure  and 
make  a  COMMIT  RPC.  If,  on  the  other  hand,  the  client  im¬ 
mediately  makes  another  UPDATE  RPC,  it  can  avoid  sign¬ 
ing  the  first  version  structure  or  making  the  first  COMMIT 
RPC.  Under  workloads  such  as  unpacking  tar  archives. 


6 


this  optimization  might  save  the  signing  cost  of  several 
consecutive  version  structures. 

Bandwidth  to  the  consistency  server  could  also  be  con¬ 
siderably  reduced.  UPDATE  RPCs  currently  return  the  en¬ 
tire  VSL  and  PVL.  However,  a  client  will  generally  al¬ 
ready  have  many  of  the  entries  in  these  lists.  The  consis¬ 
tency  server  should  therefore  return  only  VSL  and  PVL 
entries  that  changed  since  the  last  UPDATE  reply  it  has 
sent  to  a  particular  client.  Similarly,  both  UPDATE  replies 
and  COMMIT  RPCs  currently  send  the  entire  delta  log  with 
each  i-handle,  when  only  new  entries  need  be  sent.  Even 
version  structures  need  not  be  sent  in  their  entirety,  as 
most  of  their  content  other  than  i-handles  is  implicit  in  the 
order  of  PVL  entries  and  the  groups  they  affect.  On  sys¬ 
tems  with  many  users,  the  messages  exchanged  with  the 
consistency  server  should  be  proportional  in  size  to  the 
number  of  active  users  and  groups,  while  currently  they 
are  proportional  to  the  total  number  of  principals. 

Another  class  of  possible  optimizations  target  the  block 
server.  SUNDR  currently  breaks  files  into  fixed-size  8K 
blocks,  stores  the  blocks  at  the  block  server,  and  puts  their 
SHA-1  hashes  into  inode  and  indirect  block  data  struc¬ 
tures.  Because  the  block  server  handles  variable-sized 
blocks  efficiently,  the  client  could  compress  blocks  be¬ 
fore  sending,  saving  both  network  bandwidth  and  storage 
at  the  server.  Moreover,  recent  work  [17]  suggests  that 
by  breaking  files  into  carefully-constructed  variable-sized 
blocks,  one  can  increase  the  number  of  blocks  common 
across  files.  Such  commonality  would  reduce  the  number 
of  block  transfers  needed,  save  space  at  the  block  server, 
and  furthermore  increase  the  effectiveness  of  the  block 
server’s  cache. 


4  Block  server  implementation 

In  SUNDR,  all  data  and  most  metadata  are  stored  on  and 
retrieved  from  the  block  server,  called  bsrv.  bsrv  presents 
clients  with  an  extremely  simple  interface.  Clients  send 
the  server  data  blocks  and  later  retrieve  them  by  SHA-1 
hash.  Since  the  client  actually  implements  most  file  sys¬ 
tem  functionality,  the  block  server  cannot  interpret  or  pri¬ 
oritize  blocks.  Moreover,  the  server  does  not  have  any 
user  signature  keys,  and  thus  does  not  even  have  permis¬ 
sion  to  repair  the  file  system  should  something  go  wrong. 
Thus,  while  traditional  file  systems  write  many  types  of 
data  asynchronously,  crash  recoverability  requires  that 
bsrv  write  every  block  to  stable  storage  before  returning  to 
the  client.  In  addition,  block  server  performance  is  critical 
to  good  file  system  performance  in  SUNDR. 

4.1  Background:  Venti 

bsrv  is  certainly  not  the  first  content-hash-based  block 
server.  In  fact,  bsrv' s  design  takes  its  inspiration  from 
the  Venti  [19]  block  server.  Venti  appends  variable-sized 
blocks  sequentially  to  a  large  log  disk  or  RAID  array,  and 
then  indexes  them  by  SHA-1  hash  on  a  fast  SCSI  disk. 
An  index  entry  is  stored  in  a  logical  block  on  disk  de¬ 
termined  entirely  by  the  SHA-1  hash.  Overflow  entries 
in  Venti  spill  over  onto  the  next  block.  Venti  provides 
archival  storage,  i.e.,  it  never  deletes  blocks.  The  authors 
argue  that  it  nonetheless  can  be  used  to  build  a  regular  file 
system,  because  storage  capacities  are  increasing  faster 
than  users  can  generate  data.  We  agree  with  this  assess¬ 
ment,  and  have  likewise  designed  bsrv  to  be  an  archival 
block  server.  In  the  future,  we  envisage  support  for  snap¬ 
shots,  allowing  SUNDR  users  to  access  previous  versions 
of  the  file  system.  Archival  storage  would  easily  allow 
such  functionality. 

Whereas  we  follow  Venti’s  general  design,  we  argue 
that  it  has  drawbacks  that  make  it  ill-suited  for  SUNDR. 
In  particular  (1)  it  buffers  writes  so  it  cannot  guaran¬ 
tee  data  stability  (2)  it  is  strictly  archival  and  hence 
would  store  many  short-lived  blocks  such  as  i-table  hash 
tree  nodes  in  perpetuity  and  (3)  it  can  achieve  only  5- 
7  MBytes/sec  of  write  throughput,  while  we  desire  a  block 
server  capable  of  absorbing  data  at  fast  Ethernet  rates. 
To  overcome  these  problems,  bsrv  introduces  a  temporary 
log  to  Venti’s  original  design. 

4.2  bsrv  architecture 

Figure  5  shows  bsrv' s  general  architecture.  The  block 
server  requires  at  least  three  disks,  a  large  log  disk  (or 
RAID  array),  a  small  temporary  log  disk,  and  one  or  more 
index  disks.  As  an  example  configuration,  this  paper  re¬ 
ports  on  a  server  that  uses  high-capacity  IDE  disks  for  the 
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log  and  temporary  log,  while  striping  the  index  over  four 
fast  SCSI  disks. 

bsrv  accesses  disks  through  the  raw  character  device 
interface.  Because  Unix  devices  have  a  blocking  inter¬ 
face,  it  uses  a  number  of  helper  processes  to  issue  con¬ 
current  I/O  requests,  thereby  keeping  multiple  disks  busy 
and  enjoying  better  disk  scheduling  through  SCSI  tagged 
command  queuing  on  the  index  disks,  bsrv  spawns  four 
instances  of  the  helper  program,  called  aiod,  for  each  disk 
on  the  system.  It  communicates  with  them  through  large 
regions  of  shared  memory.  Each  group  of  four  aiod s  also 
listens  on  a  pipe.  To  initiate  an  I/O,  bsrv  writes  to  the  ap¬ 
propriate  pipe  an  8-byte  pointer  to  a  command  block  in 
shared  memory.  When  the  I/O  is  complete,  aiod  writes 
the  status  into  the  command  block  and  notifies  bsrv  by 
sending  the  command  block’s  address  back  over  a  Unix- 
domain  socket.  In  order  to  avoid  waking  up  four  kernel- 
level  threads  each  time  bsrv  writes  to  a  pipe,  the  aiod  pro¬ 
cesses  use  an  flock  lockfile  to  ensure  only  one  process  at 
a  time  blocks  reading  a  given  pipe,  (flock  is  specifically 
implemented  to  avoid  waking  threads  up  unnecessarily.) 
The  bsrv  interface  consists  of  the  following  five  RPCs: 

STORE  ( header.block ) 

RETRIEVE  (hash) 

DECREF  (hash) 

CSTORE  (hash) 

PSTORE  ( header ,  block) 

The  STORE  RPC  takes  a  data  block  preceded  by  a  block 
header  and  writes  them  both  to  stable  storage  if  the  server 
does  not  already  have  a  copy  of  the  block.  The  header  has 
information  encapsulating  the  owner  and  creation  time  of 
the  block,  as  well  as  fields  that  could  be  used  to  store 
the  encoding  method  and  decoded  length  of  compressed 
blocks.  The  RETRIEVE  RPC  retrieves  a  block  from  the 
server  given  its  SHA-1  hash.  It  also  returns  the  first  header 
STOREd  with  the  particular  block. 

DECREF  informs  the  server  that  a  block  with  a  particu¬ 
lar  SHA-1  hash  is  no  longer  needed  and  can  be  discarded. 
bsrv' s  dereferencing  semantics  are  conservative.  When 
a  block  is  first  STOREd,  the  server  establishes  a  short 
window  (one  minute  by  default)  during  which  it  can  be 
deleted.  If  a  client  STORES  then  DECREFs  a  block  within 
this  window,  the  server  marks  the  block  as  garbage  and 
does  not  permanently  store  it.  If  two  clients  store  the 
same  block  during  the  dereference  window,  the  block  is 
also  marked  as  permanent.  Once  a  block  has  been  marked 
permanent,  it  can  never  be  deleted  from  the  block  server. 

Two  more  RPCs  exist  that  are  not  currently  used  by 
the  SUNDR  client.  CSTORE,  short  for  “conditional  store,” 
functions  like  a  read  that  doesn’t  return  the  data — it  sim¬ 
ply  tells  the  client  whether  or  not  the  server  is  storing 
a  block  with  a  particular  SHA-1  hash.  In  addition,  any 
CSTOREd  block  is  automatically  marked  permanent.  A 


client  could  use  this  RPC  to  save  bandwidth  when  storing 
blocks  that  are  likely  to  exist  already  at  the  server. 

Internally,  the  block  server  has  a  function  pstore  (“per¬ 
manent  store”)  that  bypasses  the  temporary  log.  We  ex¬ 
pose  the  pstore  function  through  a  PSTORE  RPC  for  the 
purpose  of  testing  and  comparison.  PSTORE  functions  like 
a  store  followed  by  a  sync  in  Venti,  and  thus  is  useful  for 
quantifying  the  benefits  of  bsrv' s  temporary  log.  The  RPC 
is  not  intended  to  be  called  by  clients. 

We  now  describe  how  bsrv  uses  its  disks. 

4.3  Temporary  Log 

When  a  client  issues  a  STORE  RPC,  the  server  will  im¬ 
mediately  store  the  given  block  to  the  temporary  log  sub¬ 
system.  In  our  implementation,  this  subsystem  comprises 
a  small  raw  partition  on  an  IDE  disk,  and  an  in-memory 
cache  that  mirrors  the  contents  of  the  temporary  log.  By 
the  time  it  responds  to  the  RPC,  the  server  has  either  writ¬ 
ten  the  block  to  the  temporary  log  disk  or  located  a  copy  of 
the  block  on  a  disk  in  the  system,  thus  guaranteeing  dura¬ 
bility  in  the  event  of  a  crash.  Keeping  a  redundant  copy 
of  the  block  in  memory  assures  us  that  we  need  to  read 
from  the  temporary  log  only  to  support  crash  recovery. 
Assuming  the  temporary  log  disk  does  not  serve  requests 
generated  by  other  processes,  it  need  rarely  seek.  Rather, 
an  overwhelming  portion  of  disk  traffic  will  be  sequential 
writes,  thus  enjoying  the  same  performance  characteris¬ 
tics  as  a  database  log. 

We  should  note  that  all  writes  to  the  temporary  log  are 
aligned  along  sector  boundaries.  This  wastes,  on  average, 
half  a  sector  of  disk  space  per  block.  The  space  overhead 
is  of  little  concern  given  that  the  server  continually  recy¬ 
cles  the  temporary  log  and  that  the  temporary  log  is  very 
small  relative  to  the  capacity  of  modem  IDE  disks.  Align¬ 
ing  writes  to  sector  boundaries  avoids  the  need  for  succes¬ 
sive  writes  of  the  same  sector,  which  the  operating  system 
cannot  issue  concurrently  and  which  therefore  cost  a  disk 
rotation. 

4.4  Permanent  Log 

The  server  periodically  processes  the  temporary  log,  ag¬ 
ing  out  those  blocks  that  have  exceeded  their  dereference 
window.  If  a  block  has  been  dereferenced  but  not  accessed 
in  any  other  way,  the  server  will  delete  the  block  from  the 
temporary  log,  being  certain  not  to  enter  the  block  into 
any  of  the  various  caching  systems.  Otherwise,  the  server 
pstore s  the  block,  pstore  will  write  a  block  to  the  per¬ 
manent  log  provided  that  the  block  does  not  already  exist 
on  the  server.  Blocks  are  organized  in  the  permanent  log 
sequentially  along  4-byte  boundaries  to  optimize  storage 
density.  Once  a  block  is  written  to  the  permanent  log,  it  is 
read-only  and  immutable. 


Since  the  server  processes  the  temporary  log  in  batches, 
it  will  write  blocks  to  the  permanent  log  in  batches.  For 
a  batch  that  consists  of  n  blocks,  the  server  will  usually 
require  only  nb/m  sequential  writes  to  the  log  disk,  where 
b  is  the  average  block  size,  and  to  is  the  maximum  write 
size.  On  our  system,  b  =  8K  and  to  =  64K,  thus  we  need 
only  write  to  disk  once  every  8  blocks. 

Our  server  uses  a  large  raw  IDE  hard  disk  for  its  perma¬ 
nent  log.  The  log  subsystem  also  includes  an  LRU  cache. 
We  store  blocks  into  this  cache  if  they  have  been  redun¬ 
dantly  stored.  A  redundant  store  is  a  good  cue  that  a  block 
is  common  to  multiple  files;  caching  these  blocks  can  save 
random  seeks  on  the  permanent  log  IDE  drive  during  RE¬ 
TRIEVES. 

4.5  Index 

The  index  subsystem  serves  to  locate  blocks  on  the  per¬ 
manent  log,  keyed  by  their  SHA-1  hashes.  An  ideal  index 
would  be  a  simple  in-memory  hash  table  mapping  20-byte 
SHA-1  hashes  to  an  8-byte  disk  offset.  If  we  assume  that 
the  average  block  stored  on  the  system  is  8K,  we  see  that 
the  index  must  have  roughly  1/128  the  capacity  of  the  log 
disk.  Although  at  present,  such  a  ratio  of  disk  to  mem¬ 
ory  is  obtainable  with  commodity  components,  we  are  not 
convinced  that  memory  will  be  able  to  maintain  such  a  ra¬ 
tio  with  hard  disks  in  the  future. 

Instead,  we  adopt  Venti’s  strategy  of  implementing  the 
index  as  a  disk-resident  hash  table,  striped  over  an  array  of 
high-speed  SCSI  disks.  We  hash  20-byte  SHA-1  hashes 
down  to  ( index-disk-id ,  index-disk-offset )  pairs.  The  disk 
offsets  point  to  sector-sized  on-disk  data  structures  called 
buckets,  which  contain  15  index-entries ,  sorted  by  SHA-1 
hash,  index-entries,  in  turn  map  SHA-1  hashes  to  offsets 
on  the  permanent  data  log. 

The  server  might  access  the  index  subsystem  during 
PSTORE,  CSTORE,  and  RETRIEVE.  Assuming  the  server 
has  not  found  the  requested  block  in  the  temporary  log  or 
the  block  LRU  cache,  it  will  look  up  the  block’s  SHA- 
1  hash  in  the  index.  If  it  finds  the  hash  in  the  index,  it 
will  trigger  an  immediate  RPC  response  to  PSTORE  and 
CSTORE.  In  the  case  of  RETRIEVE,  it  will  then  access 
the  index-entry  to  find  the  block’s  location  on  the  perma¬ 
nent  log,  read  and  then  return  the  block  to  the  client.  If 
the  server  does  not  find  the  hash  in  the  index,  it  will,  in 
the  case  of  PSTORE,  write  the  block  to  the  permanent  log 
and  then  update  the  index  to  reflect  where  the  block  was 
stored,  using  quadratic  hashing  when  index  blocks  over¬ 
flow.  In  the  case  of  CSTORE,  the  server  will  respond  false 
to  the  RPC.  In  the  case  of  RETRIEVE,  the  server  will  reply 
that  the  block  was  not  found. 

As  the  server  receives  CSTORE,  RETRIEVE  and  PSTORE 
RPCs  from  the  client,  it  should  expect  a  random  order¬ 
ing  of  SHA-1  hashes.  Thus,  each  RPC  will  require  one 


random  seek  across  an  index  disk,  and  a  PSTORE  that  is 
storing  a  block  not  already  on  the  system  will  need  two. 
We  use  caching  and  striping  to  mitigate  the  effects  of  this 
bottleneck.  After  the  server  accesses  or  updates  an  index- 
entry,  it  will  store  it  in  an  LRU  cache.  Furthermore,  strip¬ 
ing  the  index  across  multiple  disks  allows  concurrency 
among  lookups  and  limits  the  amount  of  space  needed  on 
each  disk,  hence  shortening  the  range  over  which  the  in¬ 
dex  disks  heads  need  to  seek. 

Unlike  Venti,  we  do  a  large  majority  of  our  stores  via 
STORE  and  the  temporary  log;  therefore,  we  can  make  a 
much  more  efficient  use  of  the  index  subsystem.  Before 
moving  a  batch  of  blocks  from  the  temporary  log  to  the 
permanent  log,  the  server  sorts  them  by  index-disk-offset. 
That  is,  we  map  each  block  to  a  SHA-1  hash,  each  SHA-1 
hash  to  a  bucket  offset,  and  sort.  As  a  result,  the  index 
disk  arms  can  service  a  whole  batch  of  index  reads  in  one 
sweep.  We  use  the  same  technique  for  scheduling  lazy- 
wri  tes  to  the  index  disk  after  we  have  written  a  block  to  the 
permanent  log  disk.  Despite  the  sorting  of  block  batches, 
the  server  ensures  that  blocks  are  stored  on  the  log  in  the 
order  in  which  they  arrived.  Randomly  reordering  blocks 
would  be  detrimental  to  read  performance  for  large  multi¬ 
block  files. 

4.6  Crash  Recovery 

To  recover  from  a  crash  or  an  unclean  shutdown,  the  sys¬ 
tem  first  assures  the  consistency  of  the  permanent  log  and 
index  subsystems.  Because  the  server  updates  the  index 
lazily  after  storing  blocks  to  the  permanent  log,  a  fair 
number  of  unwritten  index  entries  and  hence  out-of-date 
buckets  may  exist.  Finally,  bsrv  process  the  temporary  log 
by  pstore ing  all  its  blocks  to  the  permanent  log. 

4.7  Design  Discussion 

Our  system  differs  from  Venti  mainly  in  our  use  of  a  tem¬ 
porary  log  subsystem  and  our  preference  of  STORE  over 
PSTORE.  This  architecture  conveys  several  advantages 
in  our  application  context: 

Durability.  When  a  close  command  returns  in  our 
case,  the  blocks  of  that  file  are  on  disk,  albeit  usually  on 
the  temporary  log. 

Dereferencing.  The  temporary  log  allows  our  server 
to  dereference  blocks  within  a  minute  of  their  creation. 
Because  many  files  are  short-lived,  this  saves  space  on  the 
permanent  log  and  eliminates  disk  accesses  to  both  the 
index  disk  and  the  permanent  log  disk. 

Fast  Log  Alignment.  The  temporary  log  allows  us  to 
write  blocks  continuously  when  copying  to  the  permanent 
log.  This  saves  space  but  also  time.  For  example,  suppose 
a  system  has  512-byte  sectors  and  an  application  writes 
two  1000  byte  blocks.  In  order  to  achieve  good  packing. 
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a  Venti-style  system  would  have  to  do  two  writes  to  the 
second  sector,  at  some  cost  in  rotation  time.  Ours  requires 
only  one. 

Index  Disk  Arm  Scheduling.  Writing  blocks  to  the 
permanent  log  in  large  groups  allows  us  to  sort  our  ac¬ 
cesses  to  the  index  disk  thus  minimizing  the  length  of 
seeks. 

The  net  result  is  the  semantic  benefit  of  durability  with 
file-system-like  throughput  rates  (e.g.,  UMB/sec),  a  sig¬ 
nificant  improvement  on  Venti. 

5  Performance 

Our  primary  goal  in  testing  SUNDR  was  to  ensure  that  its 
security  benefits  do  not  come  at  too  high  a  price.  In  this 
section,  we  compare  SUNDR’s  overall  performance  to 
NFS.  We  also  perform  microbenchmarks  to  help  explain 
our  application-level  results,  and  to  support  our  claims 
that  our  block  server  outperforms  a  Venti-like  architecture 
in  our  setting. 

5.1  Experimental  Setup 

We  carried  out  our  experiments  on  three  3GHz  Pen¬ 
tium  IVs  running  FreeBSD  4.7.  All  machines  are 
connected  with  100Mbit,  full-duplex,  switched  Ether¬ 
net.  We  measured  TCP  throughput  between  clients  at 
11.21  MBytes/sec,  and  round-trip  times  at  0.110  msec. 
All  machines  have  3GB  of  RAM  and  Seagate  Cheetah 
18GB  SCSI  hard  drives,  which  spin  at  15K  RPM.  We  used 
four  of  these  drives  in  our  block  server  as  index  disks,  and 
added  2  Western  Digital  Caviar  180GB  7200  RPM  EIDE 
hard  drives  for  temporary  and  permanent  log  disks. 

Our  cryptographic  implementations  use  GNU  Multi- 
precision  Library  version  4.1.2  for  large  integer  arith¬ 
metic. 

5.2  Microbenchmarks 

We  perform  a  series  of  microbenchmarks  on  both  the 
SUNDR  block  server,  the  SUNDR  client,  and  the  com¬ 
bined  system. 

5.2.1  SUNDR  Block  Server 

Our  SUNDR  block  server  currently  can  store  up  to  175GB 
of  data,  but  in  our  experimental  setup,  we  partitioned  our 
index  disks  so  that  we  could  comfortably  accommodate 
up  to  2TB  of  data.  That  is,  for  every  8K  block  stored  on 
the  server,  we  require  approximately  64  bytes  of  space  on 
the  index  disks.  An  index  entry  only  takes  up  around  32 
bytes  of  disk  space,  but  we  would  expect  performance  to 
degrade  if  the  index  becomes  more  than  half  full.  Thus, 
our  index  array  must  have  a  capacity  of  24126/213  or 
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16GB  to  meet  out  storage  needs.  Given  our  4  disk  index 
array,  we  need  only  use  4GB  of  each  disk  to  accommodate 
the  index. 

We  did  not  heavily  tune  the  block  server,  choosing 
its  parameters  somewhat  arbitrarily.  We  allow  SUNDR 
clients  to  make  up  to  40  outstanding  RPCs.  This  is  both 
for  flow  control  and  fairness.  We  set  our  dereference 
window  to  60  seconds,  and  for  the  purposes  of  the  mi¬ 
crobenchmarks,  we  tinned  off  the  block  cache.  Through¬ 
out,  we  enable  an  index  entry  cache  of  up  to  100,000  en¬ 
tries.  Our  temporary  log  is  720MB,  circularly  processed, 
and  never  filled  up  in  the  course  of  experiments.  Over 
100  Mbit  Ethernet,  the  SUNDR  block  server  can  process 
the  temporary  log  as  fast  as  blocks  can  be  sent  over  the 
network. 

Figure  6  shows  our  measurements  of  the  block  server’s 
throughput  and  latency.  Recall  that  if  we  were  to  simply 
use  Venti’s  architecture  and  did  not  have  the  benefit  of 
the  temporary  log,  we  would  use  PSTORE  and  not  STORE. 
These  results  legitimize  our  improvements:  in  terms  write 
throughput,  STORE  outperforms  PSTORE  by  106%.  For 
batches  of  data  that  can  be  transmitted  in  an  amount  of 
time  less  than  the  dereference  window,  the  write  through¬ 
put  is  limited  only  by  the  network  bandwidth.  For  very 
large  block  batches,  the  server  will  start  to  process  the 
temporary  log  while  the  client  is  still  transmitting  data. 
The  server  can  processes  the  temporary  log  from  its  in¬ 
memory  memory  at  11.72  MBytes/s.  However,  the  op¬ 
eration  places  demands  on  the  CPU,  thereby  reducing 
the  bandwidth  observed  by  SUNDR  clients.  We  see  a 
17%  decrease  in  write  throughput  while  the  server  is  pro¬ 
cessing  the  temporary  log.  Even  under  these  conditions, 
STORE  still  outperforms  PSTORE  by  72%. 

In  terms  of  write  latency,  STORE  performs  significantly 
better  than  PSTORE,  responding  to  RPCs  in  23%  of  the 
time  it  takes  for  a  PSTORE  under  quiescent  conditions,  and 
29%  when  simultaneously  processing  the  temporary  log. 

Our  block  server  does  not  perform  well  when  asked  to 
RETRIEVE  randomly  scattered  blocks.  Note  that  the  In¬ 
dex  LRU  cache  is  of  little  help.  The  bottleneck  is  the  IDE 
log  disk,  which  is  slow  to  make  random  seeks;  Venti  has 
a  similar  bottleneck.  In  the  context  of  SUNDR,  slow  ran¬ 
dom  RETRIEVES  should  not  affect  overall  system  perfor¬ 
mance  if  the  client  aggressively  caches  blocks  and  reads 
large  files  sequentially. 
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Figure  6:  Throughput  and  latencies  for  the  SUNDR  block  server.  Throughputs  were  collected  over  10,000  asyn¬ 
chronous  operations  on  8K  blocks,  with  40  outstanding  RPCs.  Latencies  were  collected  over  5,000  back-to-back 
synchronous  operations  on  8k  blocks.  The  block  cache  was  cold  in  all  experiments. 
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Figure  7:  From  the  client’s  perspective,  a  timeline  of  procedure  calls  for  creating,  reading,  and  unlinking  a  IK  file  on 
SUNDR,  in  msecs.  Bars  represent  procedure  call  durations;  they  begin  when  a  procedure  was  called  and  end  when  it 
returns.  Calls  on  the  left  side  of  the  axes  are  RPCs  to  the  block  server.  Calls  on  the  right  side  are  involved  with  file 
system  consistency,  and  are  either  RPCs  to  the  consistency  server  or  local  signatures. 
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5.2.2  Client  Profiling 

We  next  performed  a  high-level  profile  of  the  SUNDR 
client,  so  as  to  better  understand  our  other  micro  and  mac¬ 
robenchmarks.  Figure  8  shows  the  performance  of  our 
underlying  signature  scheme  in  comparison  to  the  more 
conventional  Rabin  scheme  [20].  We  use  2,048  bit  Esign 
keys  in  SUNDR,  which  can  be  broken  with  a  work  factor 
greater  than  280.  In  other  words,  an  adversary  with  infinite 
storage  could  more  easily  find  a  SHA1  hash  collision  than 
break  a  2,048  bit  Esign  key.  We  also  provide  benchmarks 
for  6,000  bit  keys,  which  can  be  broken  with  a  work  factor 
of  2128  [11], 

Figure  7  provides  a  representation  of  how  SUNDR’s 
RPCs  and  signatures  correspond  with  Unix  system  calls. 
The  three  cases  shown  correspond  to  file  phases  of  the 
LFS  small  file  benchmark  [23].  The  timings  were  col¬ 
lected  by  inserting  gettimeofday  system  calls  in  the  code, 
and  thus  reflect  a  small  amount  of  time  dilation  from  mea¬ 
surement  overhead. 

The  create  phase  calls  the  open,  write  and  close  system 
calls.  This  figure  verifies  that  close  can  only  return  once 
its  STORE  call  to  the  block  server  has  returned — hence 
the  importance  of  minimizing  the  latency  of  RPC  STORE 
calls.  The  write  call  does  not  involve  any  remote  RPCs  or 
cryptographic  operations;  for  this  reason  it  is  not  shown 
in  Figure  7. 

The  read  phase  makes  open,  read  and  close  calls.  In 
this  experiment,  almost  all  of  the  expensive  operations  are 
handled  in  the  open  call.  Since  xfs  does  not  currently  sup¬ 
port  chunked  file  retrieves,  SUNDR  fetches  the  entirety  of 
the  file  at  the  open  call.  However,  our  experiments  assume 
a  warm  client-side  cache,  so  no  RETRIEVE  calls  actually 
went  through  to  the  server.  Finally,  the  unlink  phase  con¬ 
sists  only  of  a  system  call  to  unlink. 

Figure  7  shows  our  prototype’s  efforts  to  interleave 
RPCs  and  to  delay  those  not  on  the  critical  path  until  after 
returning  to  the  user-level  caller.  We  see  that  system  calls 
do  not  wait  for  COMMIT  RPCs  to  return.  Moreover,  even 
in  small  writes,  the  client  sends  data  blocks  to  the  block 
store  concurrently,  and  waits  for  them  to  return  only  in 
the  case  of  close  and  fsync.  This  increases  concurrency 
without  compromising  our  consistency  guarantees. 

5.2.3  LFS  Small  File  Benchmark 

Next,  we  ran  our  system  using  the  LFS  small  file  bench¬ 
mark  and  measured  end-to-end  performance.  Before  do¬ 
ing  so,  we  made  a  slight  adjustment  to  the  benchmark 
suite,  and  randomized  the  contents  of  the  files  written  to 
disk.  To  write  identical  files  would  give  SUNDR’s  block 
store  architecture  an  unfair  advantage,  as  only  one  unique 
file  data  block  would  ever  need  to  be  stored.  In  this  and 
other  benchmarks,  we  compare  SUNDR  to  NFSv2  and 
NFSv3.  Indeed,  NFSv3  offers  better  performance,  but 


Figure  9:  LFS  Small  File  Benchmark.  1000  operations  on 
files  with  IK  of  random  content. 


NFSv2’s  semantics  guarantee  that  a  client’s  data  has  been 
written  to  stable  storage  on  close.  Since  SUNDR  offers 
the  same  guarantee,  we  believe  a  comparison  to  NFSv2  is 
a  fairer  one;  we  include  results  for  NFSv3  regardless. 

Figure  9  details  our  results.  SUNDR  significantly  out¬ 
performs  NFSv2  in  the  create  and  unlink  stages  of  the 
experiment — by  30%  and  18%  respectively.  SUNDR  is 
noticeably  slower  in  the  read  phase,  primarily  because  of 
latency  introduced  by  the  consistency  protocol.  In  partic¬ 
ular,  open  must  wait  for  UPDATE  to  return  before  returning 
to  the  caller. 


5.3  Application-Level  Benchmarks 

Figure  10  shows  SUNDR’s  performance  in  copying, 
uncompressing,  untarring,  configuring,  compiling,  in¬ 
stalling,  and  removing  the  Apache  Webserver  distribution 
(apache_l  .3.27  .tar . gz).  We  first  note  that  during  the 
experiment,  bsrv  processed  16,872  8K  data  blocks,  of 
which  8,023  (48%)  were  dereferenced  before  they  could 
be  committed  to  the  permanent  log.  The  configure,  com¬ 
pile  and  untar  operations  in  particular  involve  many  tem¬ 
porary  files  and  sequential  changes  to  filesystem  meta¬ 
data.  Thus,  in  these  benchmarks,  clients  frequently  DE- 
CREF  blocks  within  milliseconds  of  STOREing  them. 

SUNDR  is  competitive  with  NFSv2  in  copying  the 
source  archive  from  local  disk  to  the  SUNDR  partition;  it 
completes  this  operation  in  110  msecs,  while  NFSv2  does 
the  same  in  87  msecs.  NFSv3  actually  performs  slower  at 
this  stage,  completing  the  cp  in  199  msecs.  SUNDR’s  per¬ 
formance  is  similar  to  NFSv2’s  in  the  unzip,  untar,  com¬ 
pile  and  cleanup  portions  of  the  build  process.  The  config 
and  install  segments  clearly  present  a  problem  for  us.  We 
see  several  explanations  for  this.  Most  intuitively,  many 
operations  in  these  procedures  reference  the  file  system 
three  times:  once  to  load  an  executable  shell  script  in  the 


12 


Figure  10:  Installation  procedure  for  apache_l  .3.27.  tar .  gz. 


build  directory,  once  to  read  an  input  file,  and  once  to  store 
to  an  output  file.  At  the  very  least,  we  are  experiencing  a 
two-fold  magnification  of  the  read  bottleneck  seen  in  the 
LFS  small  file  benchmark.  Moreover,  our  current  imple¬ 
mentation  of  SUNDR  does  not  handle  concurrent  xfs  up- 
calls  optimally.  Our  prototype  sometimes  serializes  these 
requests  for  convenience. 

In  sum,  we  are  encouraged  by  these  results.  Our  exper¬ 
iments  have  shown  SUNDR  to  be  almost  as  fast  as  totally 
insecure  network  file  systems  when  writing  large  amounts 
of  data  and  files  to  the  file  system.  We  believe  that  our  ap¬ 
proach  is  fundamentally  sound,  and  that  future  optimiza¬ 
tions  and  design  improvements  will  make  SUNDR  more 
competitive  in  all  of  these  benchmarks. 

6  Related  Work 

Recently,  there  has  been  growing  interest  in  peer-to-peer 
storage  systems  comprised  of  potentially  untrusted  nodes. 
Farsite  [2]  investigated  the  possibility  of  spreading  such  a 
file  system  across  people’s  unreliable  desktop  machines. 
Several  new  distributed  hash  tables  such  as  Chord  [26] 
and  Pastry  [24]  show  the  potential  to  scale  to  millions 
of  separately  administered  volunteer  nodes,  with  CFS  [5] 
layering  a  read-only  file  system  on  top  of  such  a  highly 
distributed  architecture.  These  systems  could  potentially 
replace  the  SUNDR  block  server,  provide  automated  off¬ 
site  backup,  or  be  used  to  coordinate  cooperative  caching 
amongst  mutually  distrustful  clients. 

A  number  of  file  systems  in  the  past  have  used  cryp¬ 
tographic  storage  to  keep  data  secret  in  the  event  of  a 
server  compromise.  The  swallow  [21]  distributed  file  sys¬ 
tem  used  client-side  cryptography  to  enforce  access  con¬ 
trol.  Clients  encrypted  files  before  writing  them  to  the 
server.  Any  client  could  read  any  file,  but  could  only  de¬ 
crypt  the  file  given  the  appropriate  key.  Unfortunately, 
one  could  not  grant  read-only  access  to  a  file.  An  attacker 


with  read  access  could,  by  controlling  the  network  or  file 
server,  substitute  arbitrary  data  for  any  version  of  a  file. 

CFS  [1]  allows  users  to  keep  directories  of  files  that  get 
transparently  encrypted  before  being  written  to  disk.  CFS 
does  not  allow  sharing  of  files  between  users,  nor  does 
it  guarantee  freshness  or  integrity  of  data.  It  is  intended 
for  users  to  protect  their  most  sensitive  files  from  prying 
eyes,  not  as  a  general-purpose  file  system.  Cepheus  [8] 
adds  integrity  and  file  sharing  to  a  CFS-like  file  system, 
but  trusts  the  server  for  the  integrity  of  read-shared  data. 
SNAD  [16]  can  use  digital  signatures  for  integrity,  but 
does  not  guarantee  freshness.  PFS  [25]  is  an  elegant 
scheme  for  checking  the  integrity  of  a  file  system  stored 
on  an  untrusted  disk.  With  minor  modifications,  PFS 
could  make  strong  freshness  guarantees.  However,  PFS  is 
really  a  local  file  system  designed  to  reside  on  untrusted, 
potentially  remote  disks.  Users  on  multiple  clients  cannot 
simultaneously  access  the  same  file  system.  Plutus  [12] 
is  secure  file  sharing  system  for  untrusted  storage  which 
focuses  primarily  on  data  secrecy.  The  authors  vaguely 
mention  some  ideas  for  protecting  integrity,  but  have  not 
implemented  them  and  do  not  have  a  proof  of  security.  We 
suspect  their  system  cannot  provide  consistency. 

The  Byzantine  fault-tolerant  file  system,  BFS  [3],  uses 
replication  to  ensure  the  integrity  of  a  network  file  sys¬ 
tem.  As  long  as  more  than  2/3  of  a  server’s  replicas  are 
uncompromised,  any  data  read  from  the  file  system  will 
have  been  written  by  a  legitimate  user.  SUNDR,  in  con¬ 
trast,  does  not  require  any  replication  or  place  any  trust 
in  machines  other  than  a  user’s  client.  If  data  is  repli¬ 
cated  in  SUNDR,  only  one  replica  need  be  honest  for  the 
file  system  to  function  properly.  However,  SUNDR  pro¬ 
vides  weaker  freshness  guarantees  than  BFS,  because  of 
the  possibility  that  a  malicious  SUNDR  server  can  fork 
the  file  system  state  if  users  have  no  other  evidence  of 
each  other’s  on-line  activity. 

Pond  [22]  is  a  prototype  of  the  OceanStore  system,  a 


13 


large-scale  data  store  that  offers  certain  consistency  and 
privacy  guarantees.  Pond  trusts  an  “inner  core”  of  Byzan¬ 
tine  fault-tolerant  servers  to  reliably  order  user  file  op¬ 
erations.  Pond  uses  “heartbeat”  messages,  distributed 
throughout  a  peer-to-peer  network  of  “secondary  repli¬ 
cas”  to  convey  a  loose  notion  of  file  freshness  to  its 
clients.  If  a  client  desires  a  stronger  assurance  of  a 
file’s  freshness,  it  must  ask  the  inner  core  of  servers  to 
sign  a  cryptographic  nonce.  Like  BFS,  this  provides 
stronger  freshness  than  SUNDR  can  achieve  without  on¬ 
line  trusted  parties.  However,  signatures  are  very  ex¬ 
pensive  for  the  servers  in  OceanStore,  considering  they 
must  generate  a  threshold  signature,  which  the  authors 
report  consumes  an  order  of  magnitude  more  computa¬ 
tional  time  than  a  conventional  signature.  Pond  cannot 
very  well  increase  the  number  of  core  servers  to  distribute 
load  amongst  more  servers,  as  this  will  only  make  mat¬ 
ters  worse:  the  complexity  of  its  consistency  protocol  is 
quadratic  in  the  number  of  primary  servers.  In  SUNDR, 
by  contrast,  the  consistency  server  only  verifies  signa¬ 
tures.  File  consistency  is  provably  guaranteed  by  the 
clients,  who  sign  with  a  standard  and  computationally 
practical  signature  scheme. 

SUNDR  uses  hash  trees,  introduced  in  [15],  to  verify  a 
file  block’s  integrity  without  touching  the  entire  file  sys¬ 
tem.  Duchamp  [6],  BFS[3],  SFSRO  [9]  and  TDB  [13] 
have  all  made  use  of  hash  trees  for  comparing  data  or 
checking  the  integrity  of  part  of  a  larger  collection  of  data. 

7  Conclusions 

SUNDR  is  a  general-purpose,  multi-user  network  file  sys¬ 
tem  that  never  presents  applications  with  incorrect  file 
system  state,  even  when  data  is  stored  on  an  untrusted 
server.  SUNDR’s  protocol  provably  guarantees  data  in¬ 
tegrity  and  consistency  without  assuming  any  on-line 
trusted  parties.  By  reducing  the  amount  of  trust  required 
for  file  servers,  SUNDR  both  increases  people’s  options 
for  managing  data  and  improves  their  file  security.  Per¬ 
formance  measurements  of  our  implementation  show  that 
while  SUNDR’s  security  comes  at  a  cost,  the  protocol  is 
still  practical.  Our  prototype  implementation  gives  perfor¬ 
mances  within  a  reasonable  factor  of  NFS  and  will  most 
likely  improve  significantly  as  we  implement  proposed 
optimizations. 
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